PreDatA - preparatory data analytics on peta-scale machines

نویسندگان

  • Fang Zheng
  • Hasan Abbasi
  • Ciprian Docan
  • Jay F. Lofstead
  • Qing Liu
  • Scott Klasky
  • Manish Parashar
  • Norbert Podhorszki
  • Karsten Schwan
  • Matthew Wolf
چکیده

Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequent data presentation, visualization, and detailed analysis. In addition, scientists desire to gain insights into selected data characteristics ‘hidden’ or ‘latent’ in these massive datasets while data is being produced by simulations. PreDatA, short for Preparatory Data Analytics, is an approach to preparing and characterizing data while it is being produced by the large scale simulations running on peta-scale machines. By dedicating additional compute nodes on the machine as ‘staging’ nodes and by staging simulations’ output data through these nodes, PreDatA can exploit their computational power to perform select data manipulations with lower latency than attainable by first moving data into file systems and storage. Such intransit manipulations are supported by the PreDatA middleware through asynchronous data movement to reduce write latency, application-specific operations on streaming data that are able to discover latent data characteristics, and appropriate data reorganization and metadata annotation to speed up subsequent data access. PreDatA enhances the scalability and flexibility of the current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and inspection, as well as for data exchange between concurrently running simulations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable I/O and analytics

High-performance computing systems have already approached peta-scale with hundreds of thousands of processors/cores in many deployments. These systems promise a new level of predictive and knowledge discovery ability as researchers gain the capability to model dependencies between phenomena at scales not seen earlier. These applications are highly I/O and data intensive, leading scientists to ...

متن کامل

Chapter 13 HPC Programming Framework

13.2 Research Activities We develop high performance, highly productive software stacks that aim to simplify development of highly optimized, fault-tolerant computational science applications on current and future supercomputers, notably the K computer. Our current focus of work includes large-scale data processing, heterogeneous computing, and fault tolerance. A major ongoing project in our gr...

متن کامل

Position Paper: Using a “Codelet” Program Execution Model for Exascale Machines∗

As computing has moved relentlessly through giga-, tera-, and peta-scale systems, exa-scale (a million trillion operations/sec.) computing is currently under active research. DARPA has recently sponsored the “UHPC” [1] — ubiquitous high-performance computing — program, encouraging partnership with academia and industry to explore such systems. Among the requirements are the development of novel...

متن کامل

Clinical Data Warehousing A Business Analytics approach for managing health data

Heterogeneous health data is a critical issue when managing health information for quality decision making processes. In this paper we examine the efficient aggregation of lifestyle information through a data warehousing architecture lens. We present a proof of concept for a clinical data warehouse architecture that enables evidence based decision making processes by integrating and organising ...

متن کامل

FlexIO: Location-flexible Execution of In Situ Data Analytics for Large Scale Scientific Applications

Increasingly severe I/O bottlenecks on High-End Computing machines are prompting scientists to process simulation output data while simulations are running and before placing data on disk – ”in situ” and/or ”in-transit”. There are several options in placing in-situ data analytics along the I/O path: on compute nodes, on staging nodes dedicated to analytics, or after data is stored on persistent...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010